optimize sql query on 5B records table in postgresql database

Ask Time：2019-05-29T04:13:58 Author：ds3059

My postgresql database has the table electrical_measurement which contains approximately 5 billion records. I have indexes on every column. I am trying to perform the following query but it never ends. Wondering how I can modify it to run faster.

SELECT
    em.id AS em_id,
    em.test_board_id_in,
    em.die,
    tvt_net.name,
    mb_pad_map.x,
    mb_pad_map.y,
    em.temperature,
    em.timestamp,
    em.avg_meas_voltage
FROM electrical_measurement AS em
INNER JOIN main_board_pad_map AS mb_pad_map
    ON em.net_id_in = mb_pad_map.net_id
INNER JOIN tvt_net
    ON em.net_id_in = tvt_net.id
WHERE em.assembly_id = 1
AND em.net_id_in IN 
 (SELECT em.net_id_in 
  FROM electrical_measurement AS em
  WHERE em.assembly_id = 1
  AND em.avg_meas_voltage > 0
  GROUP BY em.net_id_in)
ORDER BY em.timestamp

This is the result from EXPLAIN:

-------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Merge  (cost=373158311.30..573643901.29 rows=1718327938 width=63)
   Workers Planned: 2
   ->  Sort  (cost=373157311.28..375305221.20 rows=859163969 width=63)
         Sort Key: em."timestamp"
         ->  Hash Join  (cost=84935808.04..171830022.94 rows=859163969 width=63)
               Hash Cond: (em.net_id_in = mb_pad_map.net_id)
               ->  Hash Join  (cost=84935424.26..161155613.60 rows=118993479 width=41)
                     Hash Cond: (em.net_id_in = em_1.net_id_in)
                     ->  Parallel Bitmap Heap Scan on electrical_measurement em  (cost=2996320.29..78903135.78 rows=118993479 width=37)
                           Recheck Cond: (assembly_id = 1)
                           ->  Bitmap Index Scan on electrical_measurement_assembly_id_idx  (cost=0.00..2924924.21 rows=285584350 width=0)
                                 Index Cond: (assembly_id = 1)
                     ->  Hash  (cost=81939087.68..81939087.68 rows=1303 width=4)
                           ->  HashAggregate  (cost=81939061.62..81939074.65 rows=1303 width=4)
                                 Group Key: em_1.net_id_in
                                 ->  Bitmap Heap Scan on electrical_measurement em_1  (cost=2953194.68..81656356.93 rows=113081878 width=4)
                                       Recheck Cond: (assembly_id = 1)
                                       Filter: (avg_meas_voltage > '0'::numeric)
                                       ->  Bitmap Index Scan on electrical_measurement_assembly_id_idx  (cost=0.00..2924924.21 rows=285584350 width=0)
                                             Index Cond: (assembly_id = 1)
               ->  Hash  (cost=266.17..266.17 rows=9408 width=38)
                     ->  Hash Join  (cost=42.32..266.17 rows=9408 width=38)
                           Hash Cond: (mb_pad_map.net_id = tvt_net.id)
                           ->  Seq Scan on main_board_pad_map mb_pad_map  (cost=0.00..199.08 rows=9408 width=16)
                           ->  Hash  (cost=26.03..26.03 rows=1303 width=22)
                                 ->  Seq Scan on tvt_net  (cost=0.00..26.03 rows=1303 width=22)
(26 rows)

Do you have any suggestions? Thanks

Author:ds3059，eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article：https://stackoverflow.com/questions/56349468/optimize-sql-query-on-5b-records-table-in-postgresql-database

Ancoron :

The sub-select only is one of your issues, but you could use EXISTS instead:\n\nSELECT\n em.id AS em_id,\n em.test_board_id_in,\n em.die,\n tvt_net.name,\n mb_pad_map.x,\n mb_pad_map.y,\n em.temperature,\n em.timestamp,\n em.avg_meas_voltage\nFROM electrical_measurement AS em\nINNER JOIN main_board_pad_map AS mb_pad_map\n ON em.net_id_in = mb_pad_map.net_id\nINNER JOIN tvt_net\n ON em.net_id_in = tvt_net.id\nWHERE em.assembly_id = 1\nAND EXISTS (SELECT 1\n FROM electrical_measurement AS tmp\n WHERE tmp.avg_meas_voltage > 0\n AND tmp.net_id_in = em.net_id_in)\nORDER BY em.timestamp\n\n\nThen, you should have an index covering at least both, net_id_in and avg_meas_voltage. By that, you should be eliminating the Bitmap Heap Scan, the Group Key and the HashAggregate in one shot.\n\nLast but not least, you are dealing with time-series information and you are querying all of the data, sorted by time, which is going to be really slow (most likely reverting to a disk sort, instead of in-memory) with so many estimated rows (~1.7B).\n\nIf you really, really need all of the data from the beginning of all time in your big table and you really need to sort it, then make sure that you have a separate low-latency, high-throughput storage available, create a tablespace and set temp_tablespaces option to it (disk sorts then will go there instead of the default tablespace).",

2019-05-29T00:38:25

Gordon Linoff :

You can try window functions:\n\nSELECT . . .\nFROM (SELECT em.*,\n COUNT(*) FILTER (WHERE em.assembly_id = 1 AND em.avg_meas_voltage) OVER (PARTITION BY em.net_id_in) as cnt\n FROM electrical_measurement em\n ) em JOIN\n main_board_pad_map mbpm\n ON em.net_id_in = mbpm.net_id JOIN\n tvt_net\n ON em.net_id_in = tv.id\nWHERE em.assembly_id = 1 AND\n cnt > 0\nORDER BY em.timestamp\n",

2019-05-28T21:47:19

optimize sql query on 5B records table in postgresql database